In skranz/RTutorBankRuns: RTutor problem set understanding bank runs

Problemset: Understanding Bank Runs

Author: Joachim Plath

< ignore

setwd("C:/Users/Joachim/Documents/BC/Atombombe")

library(RTutor)
ps.name = "understanding bank runs" 
sol.file = paste0(ps.name,"_sol.Rmd") 
libs = NULL
libs = c("foreign","reshape2","plyr","dplyr","mfx", "ggplot2","knitr","regtools","ggthemes","dplyrExtras","grid","gridExtra","prettyR") # character vector of all packages you load in the problem set

name.rmd.chunks(sol.file,only.empty.chunks=FALSE)

# Create problemset
create.ps(sol.file=sol.file, ps.name=ps.name, user.name=NULL,libs=libs, extra.code.file = "extracode.r", var.txt.file = "variables.txt")

# webbrowser
show.ps(ps.name, load.sav=FALSE, launch.browser=TRUE, sample.solution=TRUE, is.solved=!TRUE)

>

This problem set analyzes factors leading to bank runs. It is developed from the following paper: "Understanding Bank Runs: The Importance of Depositor-Bank Relationships and Networks", written by Rajkamal Iyer and Manju Puri. You can download the paper from nber.org/papers/w14280
to get more detailed information.

The dataset and the Stata code, can be downloaded from aeaweb.org/articles.php?doi=10.1257/aer.102.4.1414

Exercise Content

Overview:
How to use RTutor and introduction to the issue.
Descriptive and Inductive Statistic:
Summary statistics for the whole dataset and sub groups.
General Overview and the Impact of an Insurance Cover:
Model introduction and a first probit regression.
Stata vs. R:
Differences of Stata and R: how to deal with perfect prediction.
Relations of a Loan and the Insurance Cover:
Do all depositors run which are above the insurance?
Importance of Bank-Depositor Relationship:
How can the bank-depositor relation be assessed?
Influence of Social Networks:
How much influence does a network have on the running decision?
Robustness Check:
Checks whether the findings are dependent on some omitted factors.
Conclusion

Exercise 1 -- Overview

The first exercise is an introduction to RTutor, in order to help you learn how to deal with this problemset.
Furthermore, we will define a bank run and take a look on how the definition is reflected in our underlying dataset.

1.1 Introduction to RTutor

Before you start to use the RTutor html version, you need to be familiar with the interface. In the problem set you have to solve one code chunk after the other. Consequently you have to start with the first task of an exercise and continue step by step until the last task of an exercise. Notice that you can do the exercise in a different order so that you can choose which one you want to work on. If you click on one of the numbered buttons on top of the page, you can skip to the related exercise. If you click on the Data Explorer button, you will get an overview of all loaded data.
All your commands have to be written into the white fields and can be checked on correctness by clicking on the check button. The directions are always mentioned in the Task. Sometimes you'll need further information to solve a task, which is always given in an info-block. In order to see the whole information, you need to click on the highlighted info-block. The other buttons above the code field are explained in the first exercise. Also keep in mind that: - Previous hints: will always be highlighted in italics - functions(), packages or variables will always be highlighted in red

At the beginning of each exercise, all required data will be loaded, because RTutor doesn't recognize variables from previous exercises by default. Moreover you will have a better overview, which dataset is used in which exercise. Some chapters in this problemset refer to a certain part or table in the replicated paper. References to this paper are attached in brackets behind the heading of the exercise.

1.2 Download and Load

For this exercise, you will have to return to the homepage where you've downloaded this problem set. Download the dataset data_for_transaction_accounts.dat into your current working directory.

Task: Use the command read.table() in order to read the downloaded dataset. Subsequently store it into dat_trans.
When you're finished, click on the check button. If you need further help, click on the hint button, which provides you with more detailed information.

< info "read.table"

The command read.table() reads a file in table format and creates a data frame out of it. If you set the working directory correctly the name of the dataset can only be typed in parentheses.

read.table("data_for_transaction_accounts.dat")

Otherwise, you have to set the full path. E.g.:

read.table("C:/mypath/data_for_transaction_accounts.dat")

In order to store your results, proceed as follows:

data=read.table("data_for_transaction_accounts.dat")

Both explanations, storing or saving you results the variable have the same meaning. Make sure that you always proceed as mentioned above.

>

< info "what do the buttons do"

The check button evaluates your code and checks if the answer is correct. If you check an incorrect command, a hint is automatically given. If you need help, press the hint button in order to get additional information. The hint may give you parts of the solution or suggestions about some critical issues. The run chunk button processes your code but does not check it. This can be useful if you want to proof the output of your commands or get some additional information. The data button shows the datasets, loaded in the current exercise. In the purpose of getting an overview of all loaded datasets, click on the Data Explorer button at the top of the page.

>

dat_trans=read.table("data_for_transaction_accounts.dat")
#< hint
display("Just type: dat_trans=read.table(\"data_for_transaction_accounts.dat\") and press check!")
#>

To get an overview of the data click on the data-button, which shows you a description of the single variables respective to the column titles.

< info "Collected data"

The dataset of dat_trans contains all depositors who have a transaction account at the headquarters of a bank, which faced a run at March 13, 2001. The bank was located in the state Gujarat, India. The precipitating event was the default of the largest cooperative bank in the state of Gujarat. Neither did the bank have any inter-bank exposures to the defaulted bank nor did the bank have stock investments. Consequently, we can assume that the bank was solvent and healthy at the time when it faced the run. Moreover, the state economy was performing well. Given this scenario we can strongly assume, that the run was a result of an idiosyncratic shock so that we can focus on the impact of the depositors.

For additional information about the collected data, look at part II and III in the mentioned paper.

>

1.3 What is a Run? (Chapter IV, Figure 3)

The phenomenon where depositors rush to withdraw their deposits because they believe the bank to fail is called a bank run. Nevertheless, the amount of how much a depositor needs to withdraw in order to be counted as a runner remains open in this definition. According to Iyer and Puri, a runner is a depositor who withdraws more than 75% of his or her deposits at March 13, 2001.

The running behavior can be measured with the variables runner75, runner50 and runner25. In order to get a first impression of these variables, click on the data button of the last code field and take a look at the related columns.

As you can recognize, these variables are all binary coded. This means that they take either the value of one or zero. Furthermore, these variables indicate whether a depositor withdraws respectively more than 75%, 50% or 25%. To understand to which extent the definition of a runner depends on the threshold of the withdraw level, we compute the sum of these columns.

The following command shows you how to compute the sum of the column runner75. This time you only have to click on the check button.

#< task
sum(dat_trans$runner75)
#>

Similar, we compute the sum of runner50. Only press check.

#< task
sum(dat_trans$runner50)
#>

Now its your turn:

Task: Use sum to compute the sum of the column runner25 of the dataset dat_trans.

sum(dat_trans$runner25)
#< hint
display("Proceed as in the previous examples. You just have to adjust the column names.")
#>

As you can see, the calculated sums are decreasing in the threshold of the withdraw level. This is easy to explain: The proportions of the depositors who withdraw more than 75% are a subset of the depositors who withdraw more than 50%. Further, these depositors are also a subset of the depositors who withdraw more than 25% of their deposits.

To calculate all these numbers within one command, we can use the summarise_each function out of the dplyr package. For the next task, it is recommended to look at the given info block, which is given below.

< info "summarise_each"

The summarise_each function is part of the dplyr package. It applies one or more functions to the mentioned columns of a given dataset. Also, it recognizes data, which is already grouped and calculates the given functions separately for each group. For example, if you want to calculate the sum of the variables runner75 and runner50, which are part of the dataset dat_trans, use:

library(dplyr)
sum_wide=summarise_each(dat_trans,funs(sum),runner75,runner50)
sum_wide

Note: always when we mention - show the output - you have to type the assigned variable into the last line. If you click on check, RTutor evaluates all your commands and also shows the output.

If you want to learn more about how to use a certain function, it is useful to read the related pdf-file. You can quickly find them if you google the requested function. In our case you can look at:

cran.r-project.org/web/packages/dplyr/dplyr.pdf

>

Task: Make use of the summarise_each function to compute the sum of the variables runner75, runner50 and runner25 which are part of the dataset dat_trans. Store your result in sum_wide.
Finally show your results, typing sum_wide into the last line.
Previous hint: Look at the info-block to see how to use summarise_each. Don't delete the given command, it's part of the solution.

#< task
library(dplyr)
#>
sum_wide=summarise_each(dat_trans,funs(sum),runner75,runner50,runner25)
sum_wide
#< hint
display("Only add runner25 to the given example in the info-block!")
#>
#< add_to_hint
display("Just use: sum_wide=summarise_each(dat_trans,funs(sum),runner75,...) and type sum_wide into the next line.")
#>

Now, we want to plot our results using the ggplot function. This function needs a data-frame in the long format. To get this long format, we use the melt command out of the reshape2 package. As you see, there is no task, so only click on the check button.

< info "melt()"

melt() can be applied to transform data, that we need in order to plot graphs. melt(data,id.vars,measure.vars,variable.name,value.name) creates a data frame in dependence of the given id.vars. Other variables that you want to have in the new dataset must be given by measure.vars. Out of this specifications, a data frame is created with the same number of id-columns as given in id.vars. Additionally, there is one column each for the measured variable and its value.
For further reading take a look at: had.co.nz/reshape/introduction.pdf

>

#< task
library(reshape2)
sum_long=melt(sum_wide)
sum_long
#>

Compare the variables sum_long and sum_wide. sum_long has only two columns: one for the value of each column and one for the name of each column. Now this data structure can be used to plot the different sums with ggplot.

< info "ggplot"

ggplot is a function of the package ggplot2, which is an implementation of the so called grammar of graphics. Commands, which generate a plot all follow the same structure. The basis command ggplot is extended by various components, which are added with the + operator. In our case, we use the basic command ggplot(data,aes(x,y,fill)), which needs a comprised dataset that contains the data that we want to plot. aes specifies the aesthetic mappings, which are passed to the plot elements. Here we need to have a categorical variable for the x- axis and a continuous for the y-axis. As mentioned above, you can add various geometries to a plot with functions starting with geom_, using the + operator. All available geometries and additional functions are listed at the ggplot2 webpage: docs.ggplot2.org/current/index.html

For a good introduction, look at: noamross.net/blog/2012/10/5/ggplot-introduction.html

If you've prepared your data mydata with the melt command, you are now able to use ggplot as explained below in order to get a bar-graph:

library(ggplot2)
plot=ggplot(mydata,aes(x=variable,y=value,fill=variable))+
  geom_bar(stat="identity")
#you only need to type the variable into a new line to display the plot after you've pressed the `check` button
plot

The term "identity" means, that the bars represent the values in the data. If you use "bin" instead, the bar-graph represents the aggregated cases for each occurred x. In this case you're not allowed to set y to a specific column.

It is also possible to add elements to your plot later on, as far as you have stored your plot earlier on.

plot=plot+ggtitle("MyHeading")
plot

>

Task: Create a bar-graph applying the ggplot command. Make sure, that you use the variable-column of sum_long as x-axis and the value-column as y-axis. Further set fill=variable.
Dont forget to store your result in the variable plot1 and show your graph.
Previous hint: Take a look at the ggplot info-block! The needed package is already loaded for you.

#< task
library(ggplot2)
#>
plot1=ggplot(sum_long,aes(x=variable,y=value,fill=variable))+
   geom_bar(stat="identity")
plot1
#< hint
display("Create the graph as described in the info block, but instead of mydata use sum_long. Show the graph by typing plot1 into last line.")
#>

#< add_to_hint
display("Just type plot1=ggplot(sum_long,aes(x=variable,y=value,fill=variable))+
   geom_bar(stat=\"identity\"). Finally, type plot1 into the last line.")
#>

The graph shows you the sum of all runners depending on the threshold. With an exact value of 307, the number of depositors who withdraw more than 75% of their deposits seems to be very small compared to the 10691 observations in our dataset. The difference between the sums of runner25 and runner75 displays, that the biggest part of the runners withdraw more than 75%. We could interpret the level of withdrawals as a measure of panic: The more a depositor withdraws, the more panic he or she got. Therefore, the biggest part of the people who withdraws seems to be directed by panic. Even though the percentage of runners according to the 75% threshold is only 2.87%, this percentage goes hand in hand with fact that even a small fraction of depositors can cause a bank run. These numbers are quite similar compared to other bank runs. E.g. the run on the IndyMac bank was caused by less than 5% of the depositors. To get a better understanding of the ggplot-graphs, we want to make our plot look more appropriate. An explanation of what we see is missing in the graph. Moreover, the label of the y-axis should be "sum" instead of value.

Task: Set a heading by adding ggtitle("Number of Runners depending on the running level\n") to your existing plot using the + operator. Make sure that you dont forget to store your results again in plot1 and show plot1 afterwards.

plot1=plot1+ggtitle("Number of Runners depending on the Running Level\n")
plot1
#< hint
display("Look at the second code example of the infoblock \"ggplot\"!")
#>
#< add_to_hint
display("To create the plot, you only need to type plot1 +ggtitle(\"Number of Runners depending on the Running Level\n\"). Don't forget so store and show your results!")
#>

The "\n" at the end of the heading creates a newline after the heading. This makes the plot less squeezed.

Task: Label the y-axis of plot1. To do so, add ylab("Sum of Runners") with the + operator to plot1. Show the plot immediately and don't store your results.

plot1+ylab("Sum of Runners")
#< hint
display("Just look at the info-block of ggplot!")
#>

Exercise 2 -- Descriptive and Inductive Statistic

After getting more familiar with the term bank run, we now want to get a more precise look on our dataset. We further want to examine factors, which influence the running decision.

2.1 Load Data

As mentioned in the introduction, we will load the dataset which is the basis of our analysis. These loadings will be done automatically by first clicking on edit and then on check.

#< task
dat_trans=read.table("data_for_transaction_accounts.dat")
#>

2.2 Summary Statistics (Chapter IV, Table 1A)

In this part, we want to understand the structure of the underlying data. The structure can be easier understood by visualizing some key-characteristics of the data. To get a first overview, we are going to compute summary statistics, which contains the mean, the standard deviation, and the number of observations, that doesn't contain NA's.

Task: Apply the describe function to your dataset dat_trans. Set num.desc=c("mean","sd","valid.n").
Previous hint: Since the required package is loaded, you only write your command into the subsequent lines.

< info "describe"

describe needs a data frame as a first input parameter. It computes the measures given by num.desc for each column. This functionality looks similar to the sapply formula, which also computes given functions for all columns in the dataset. Indeed, the describe-formula internally calls sapply.

A code example of how to use it is provided below:

describe(dat_trans,num.desc=c("mean","sd","valid.n"))

For additional parameters and other summary commands, take a look at the pfd-file of the prettyR package: cran.r-project.org/web/packages/prettyR/prettyR.pdf

>

< info "NA"

NA is a logical constant of length one, which contains a missing value indicator. It simply indicates that there is no entry for the selected item. describe takes account for the NA's if you use valid.n

>

#< task
library(prettyR)
#>
describe(dat_trans,num.desc=c("mean","sd","valid.n"))
#< hint
display("Just copy the example of the info-block. You don't need to store your results.")
#>

#< add_to_hint
display("Just type: describe(dat_trans,num.desc=c(\"mean\",\"sd\",\"valid.n\"))")
#>

If we now want to interpret these results, we have to bear in mind what each variable describes and how it is scaled. Some variables are transformed from their original meaning in order to receive estimates, which can be interpreted more easily. For example, the opening balance at the day of the run is counted in 100's of RS. Therefore, the average opening balance was Rs. 3259. Some of the variables don't make sense to interpret but are shown because we don't want to extend the problem set with select-commands. E.g. the address is simply a number, which can be set in various ways and can't be interpreted.

2.3 Grouped Analysis (Chapter IV, Table 1B)

After we have made up a rough overview, we now need to think of how the different variables have an impact on the running behavior, which is the core of our analysis. To accomplish that, we divide our observations into runners and stayers according to the 75% threshold.
To realize that, we need to create a new column to our dataset, called type to which we assign the value runner if runner75 equals one and stayer if runners75 equals zero. This will make our commands easier and the legend of our plots will be more intuitive. As this task is already accomplished for you, you only need to click on the check button.

#< task
dat_trans$type=ifelse(dat_trans$runner75==1,"runner","stayer")
#>

Task: Use the group_by command out of the dplyr-package to group dat_trans by type.
Dont forget to store your result in grouped_dat.

< info "group_by"

group_by() is part of the dplyr package. It takes the data and converts it into grouped data. The grouping should be done by categorical variables and can be done by multiple variables. The following operations on the data will be done on the grouped data. An example will show you how to use the command:

library(dplyr)
# group data by only one column
group_dat=group_by(dat_trans,type)

>

#< task
library(dplyr)
#> 
grouped_dat=group_by(dat_trans,type)
#< hint
display("This command has only two input-parameters: dat_trans and type. Don't forget to store your computations.")
#>

#< add_to_hint
display("Only type: grouped_dat=group_by(dat_trans,type).")
#>

In a next step, we want to visualize the means of the groups. We further don't want to look at the whole dataset, because taking the means of some variables doesn't make sense, as shown in the example of the adress variable. Therefore we only take a subset, consisting out of: minority_dummy, above_insurance, loanlink, avg_deposit_chng, avg_withdraw_chng, opening_balance, ln_accountage, avg_transaction. All of these variables are candidates, which have an impact on the running decision. To get an economic reasoning for the selected variables, take a look at the following info-block.

< info "variables of interest 1"

By analyzing bank-runs, we have to think about economic reasonable factors, which could drive depositors to run. Deposit insurance is a widely used instrument to prevent bank runs. For example, the US rose the insurance limit from 100000$ to 250000$ during the financial crisis. Consequently, we take the deposit insurance into account using the variable:

above_insurance

The relation between bank and depositor could also affect a run. The more intensive this relation is, the more information a depositor can gain about the health of the bank. This relation is measured by a set of variables: - loan_linkage - ln_account age - transactions

Another factor, which could lead the depositor to run, is herd-behavior. We measure this phenomenon by defining a minority. As most of the people in India are Hindu, Muslims are defined as the minority. We measure how the affiliation to a certain group affects the running decision through: - minority_dummy

The amount of money, which a depositor has on his account at the day of the run, is another crucial factor. We already took account for the insurance cover effect. Therefore, we only look at the balance if the amount is smaller than the insured cover. This leads to the variable: - opening_balance

>

Task: Apply the summarise_each() function to calculate the mean for each of the following variables: minority_dummy, above_insurance, loanlink, avg_deposit_chng, avg_withdraw_chng, opening_balance, ln_accountage and avg_transaction.
Make sure that you dont forget to save your result into the variable mean_wide and show the output.
Previous hint: This time you can see a part of the command displayed in green. Delete all of the # in front of the commands and complete these given commands.

#< task_notest
# Only replace the ??? with the mentioned function and delete the #s
# mean_wide=???(grouped_dat,funs(mean),minority_dummy,above_insurance,loanlink,avg_deposit_chng,avg_withdraw_chng,opening_balance,ln_accountage,avg_transaction)
# mean_wide
#>
mean_wide=summarise_each(grouped_dat,funs(mean),minority_dummy,above_insurance,loanlink,avg_deposit_chng,avg_withdraw_chng,opening_balance,ln_accountage,avg_transaction)
mean_wide
#< hint
display("Only replace the ??? with summarise_each.")
#>

Our aim is to visualize the calculated means using the ggplot-function. Remember that ggplot needs one column for the x-axis which should be categorical and representative for the different variable names and one for the y-axis, in our case the calculated means.

Task: Use the melt() command to melt mean_wide with "type" as id-variable. Remember to store your command in mean_long for further purposes and show your results.

#< task_notest
# Only adapt the ??? in the command below and delete the #s.
# mean_long=melt(mean_wide,id="???")
# mean_long
#>
mean_long=melt(mean_wide,id="type")
mean_long
#< hint
display("Proceede as in Exercise 1.3!")
#>

#< add_to_hint
display("Did you forget to put in the id-variable in parentheses?")
#>

The format of the returned tables looks very similar to the tables of exercise 2. All columns, which contain numerical values are transformed into a single column, with the columtitle as rowtitle. Further we have set an id-variable now, which is displayed in the first column for every value of the non-id variables. The length of the table depends on the number of the groups: $length = # groups \cdot # columns$, whereas $#$ denotes the total number.

In the next step, we want to visualize our results to get a better understanding of the different characteristics of the groups. We are especially looking for variables that have discrimination power, which means that the difference between runners and stayers is large. For this purpose, we draw a bar-graph, which you need to refine later on. For this time, you only need to click on the check button.

#< task
# this is the basis command
plot2=ggplot(mean_long,aes(x=variable,y=value,fill=type))+ 
  # you need positio_dodge() to draw the bars beside each other
  geom_bar(stat="identity",position=position_dodge())+ 
  # -> info-block
  geom_text(aes(ymax=0,y=value/2,label=round(value,3)),position = position_dodge(width=1),vjust=-0.25,size=3)+ 
   # -> info-block
  facet_wrap(~variable,scale="free")+
  xlab("")+
  ylab("")+
  ggtitle("Grouped Means\n")
plot2
#>

< info "facet_wrap"

Faceting partitions a plot into a matrix of panels. facet_wrap(~variable) creates a single panel for each variable. This is useful if you have data with different scales. Therefore, you should not compare the different panels but the plots within the panels. If you further add scale="free", the scales are adjusted and each panel has its own scale.
For additional information, take a look at: sape.inf.usi.ch/quick-reference/ggplot2/facet

>

< info "geom_text"

geom_text is useful if you want to add labels on your graph. It needs the position defined by aes(x=positionx,y=positoiony,label=value) and the label, which is drawn into your plot. A general help for different variations of the ggplot functions can be found at: docs.ggplot2.org/current/geom_text.html

>

Before we start interpreting our result, take a look at the plot above. There you can recognize that each panel is labeled twice: At the top and at the bottom. For this reason, we delete the labels of the x-axis.

Task: Display plot2 and make use of the command scale_x_discrete(breaks=NULL) using the + operator in order to delete the labels of the x-axis.

plot2+scale_x_discrete(breaks=NULL)
#< add_to_hint
display("Just type plot2+scale_x_discrete(breaks=NULL) to get the graph.")
#>

Now we get around interpreting the plotted bars: Remember that we search for variables, which have a large power of discrimination. Regarding the decision to run a bank, the above_insurance variable has the largest impact: the fraction of depositors who are above the 100000 insured Rs. is nearly 20 times higher represented in the runner-group. This striking difference can be explained easily: the amount above the insurance cover is at stake in case of a default of the bank. Consequently, a rational depositor should run if he or she is above the cover. Nevertheless, we see that 0.7% of the stayers are also above the cover as well. If we get an explanation for this kind of behavior, we could prevent or convince depositors, who are above the insurance cover from running.

We also recognize, that the deposit balance (opening_balance) is much higher for runners than for stayers at the day of the run. This phenomenon is consistent with our explanation of the insurance cover.
In a nutshell: the more we have, the more we can lose.

This pattern also approves that if only a small number of depositors ran, it can have a large impact on the solvency of the bank, in the case that the runners are rich enough. Another factor, which has a huge impact on stayers, but relatively little effect on runners is the loanlink variable. Imagine that a depositor who has an outstanding loan at the bank will have more contact to the banking stuff than a depositor who only stores his money at the bank. Out of this relation, he or she might gain some information, which strengthens his or her opinion about the healthiness of the bank.

2.4 Validate Results (Chapter IV, Table 1B)

After we have taken a closer look at the calculated means, we further want to check how significant these differences are. This validation can be done through a two-sample-t-test. In this case, we conduct a t-test with different standard deviations and unpaired samples. A large t-statistic indicates that the null hypothesis is wrong.

< info "two sample t-test"

Consider two unpaired samples $(X_{11},...,X_{1N_{1}})$ and $(X_{21},...,X_{2N_{2}})$. Unpaired means, that not only the single observations within the samples, but also the samples themselves are independent. The two sided two sample t-test, proves whether the difference of the means $\mu_{1}$ and $\mu_{2}$ of two samples are unequal 0. It assumes normally distributed and independent observations, which enables us write:
$$ H_{0}: \left | \mu_{1}-\mu_{2} \right |= 0 \; \; \; \; \; \; vs. \; \; \; \; \; \; H_{1}: \left | \mu_{1}-\mu_{2} \right |\neq 0 $$
In case of unknown standard deviations the test-statistic is calculated as follows: $$ t=\frac{\bar{x}{1}-\bar{x}{2}}{\sqrt{\frac{\hat{\sigma} ^2_{1}}{N_{1}} + \frac{\hat{\sigma} ^2_{2}}{N_{2}}}} $$ where $\hat{\sigma}^2$ denotes the estimator of the variance and $\bar{x}$ denotes the estimator for the expectation.
The test-statistic then is approximately Student-t-distributed.

For further reading, take a look at: Greene (Econometric Analysis, 2008) - Chapter 16, p.500-502 Estimation Methology

>

To conduct a t-test, use the sapply() function and then apply a subset of dat_trans to the function. We use the function TTest(), which is the first function that shows the elements of the t-test in a compromised format. This time you only need to take a look at the command, but later on you'll do this task on your own.

Therefore, we subset the variables measured in our bar-graph before. You don't need to compute this time, just click on check.

< info "select"

The select(data,col1,col2) command is part of the dplyr-package, which extracts certain columns out of the given dataset. It thus returns a subset of the original dataset.

>

#< task
# we overwrite the select function since it is defined in several packages
select <- dplyr::select 
subset1=select(dat_trans,runner75,minority_dummy,above_insurance,loanlink,avg_deposit_chng,avg_withdraw_chng,opening_balance,ln_accountage,avg_transaction)
#>

The next step is conducting the test. This time you don't need to type in the right command. Only press the check-button. But please acknowledge that you have to do it on your own in the fourth exercise.

< info "sapply()"

sapply() is a function, which is very useful for data-manipulation. It has two input parameters: sapply(data,FUN) where data should be a data-frame or a list and FUN a function that is applied to each column of data. Thus the function returns a data-frame which has as many columns as the input data. For furthere information, look at ats.ucla.edu/stat/r/library/advanced_function_r.htm

>

< info "TTest-function"

The function TTest has two parameters of which the first one is the sample to be grouped; the second one is the grouping variable. To avoid inconsistencies, the parameters should be vectors from the same dataset. For visualizing purpose, we only want to show the estimated mean, the p-value and the t-statistic.

>

#< task
t(sapply(subset1[-subset1$runner75],function(x) round(TTest(x,subset1$runner75),3)))
#>

By looking at the p-values in the bottom table, we see that all variables except minority_dummy and avg_deposit_chng have significant differences between runners and stayers at a 1% level. A p-value below 1% means that if in reality a variable would not systematically differ between runners and stayers. The probability to find differences as extreme or extremer as in our sample would be below 1%. The size of the t-statistic depends on the difference of the means and on the standard deviation. Out of the bar graph and the given statistic we can proof this statement by looking at the above_insurance variable and the opening_balance variable: the heights of the related bars are so different that we can assume a large t-statistic, if the standard deviations are not too big. Indeed, the standard deviation of above_insurance is smaller one, which increases the t-statistic.
These findings strengthen our guess, that the selected variables have an impact on the running decision.

< info "p-value"

The rejection-area is monotonous increasing in the significance-level, which means that the larger the significance-level is, the larger the rejection area is. The p-value is the largest significance level for which the null-hypothesis, that the mean equals 0, isn't rejected (in case of a two sided test). That's the case when the test statistic is larger than the related quantile. Consequently, one must reject the null hypothesis, if a significance level is larger than the p-value. Therefore, a small p-value is evidence for the alternative-hypothesis. As a rule of thumb we can assume a p-value of 5% if the t-statistic is around +/- 1.96.

>

Exercise 3 -- General Overview and the Impact of an Insurance Cover

After we have made up a general overview of the data, we have to go one step further and think in a little more abstract ways about the running behavior. A depositor runs because he thinks that the bank will go insolvent. His or her opinion can be influenced by two different sources. Firstly, the information he or she has about the health of the bank and secondly, the information he or she got from the behavior of others. We will examine these two sources of information, starting with the personal information.

3.1 Load Data

As in Exercise 2, we will load the dataset on which we drive our analysis. These loadings will be done automatically, so that you need to first click on edit and then on check.

#< task
dat_trans=read.table("data_for_transaction_accounts.dat")
#>

3.2 Who Runs (Chapter IV, Table 2)

First of all, we have to think about on an appropriate model. Bear in mind that we want to model the running behavior, that is expressed through the binary variable runner75. This leads us to the so called probit approach, which models the running probability p with the standard normal distribution: $\mathbb{P}(runner75=1|x)=\Phi (x^{\top }\beta)$.

< info "Deriving the probit approach:"

Think about a model, which is appropriate to measure the impact of some factors on the running decision. Bear in mind that the decision to run is binary coded through the variable runner75. We therefore need a model, which links the decision to run to a set of factors, like we do it in a regression. Our approach will be to analyze each of these factors in the framework of probability models:

$$ Prob(despositor \; run)=Prob(runner75=1)=F(relevant \; \mathit{effects},\mathit{parameters}) $$

We assume that multiple factors x explain the decision to run. Therefore we can write:

$$ \mathbb{P}(runner75=1|x)=F(x,\beta ) $$ $$ \mathbb{P}(runner75=0|x)=1-F(x,\beta) $$

The set of parameters $\beta$ reflects the impact of changes in $x$ on the probability.
The only thing we need to find is an appropriate model for the right hand side of the equation.
A first thought may be to retain a linear regression model:
$$ F(x,\beta )=x^{\top }\cdot \beta $$

One problem is, that this function isn't constrained to the 0-1 interval.
Our requirement is a model, which produces predictions that are consistent with the following thoughts:
The model should predict a high running probability for the depositors that run and a low running probability for the depositors that stayed.

The standard-normal distribution fulfills all our requirements and is therefore an appropriate link-function $F$.
We therefore are able to write:
$$ F(x,\beta)= \Phi (x^{\top }\cdot \beta )=\int_{-\propto }^{x^{\top }\cdot \beta } \phi (t) dt= \int_{-\propto }^{x^{\top }\beta } \frac{1}{\sqrt{2\pi }} exp(-\frac{1}{2} {x}^2) $$
We call it link-function, because we link linear combinations of the factor x'b with a function F.

Our estimates of $\beta$ will be based on the method of maximum likelihood. Each draw of runner75 is treated as an independent draw of a Bernoulli distribution.
The likelihood-function for a sample of N depositors, can be written as:

$$ L=\prod_{i=1}^{N} \phi(x,\beta )^{runner75_{i}}\cdot \phi(x,\beta )^{1-runner75_{i}} $$

This common density is maximized with respect to $\beta$, which leads us to the problem that the first-order condition can't be solved analytically. Therefore we use the Newton's method, which will usually converge to the maximum of the likelihood in just a few iterations.

The interested reader should take a look at the following book for further reading: Greene (Econometric Analysis, 2008) - Chapter 23, p. 771 ff. Models for Discrete Choice

>

Task: Use the glm() command to regress runner75 against: minority_dummy, above_insurance, opening_balance, loanlink, ln_accountage, avg_transaction, avg_deposit_chng, and avg_withdraw_chng from the dataset dat_trans . Dont forget to store the regression output in the variable reg1.
Previous hint: Further you can delete all the # before the given command and then adapt it!

< info "glm()"

glm()is used to fit generalized linear models. If we want to compute a probit regression, we have to set family=binomial(link="probit"). "). The following example explains it best:

reg=glm(runner75~minority_dummy+above_insurance+opening_balance,family=binomial(link="probit"),data=dat_trans,na.action=na.omit)

The formula notation has the following meaning: regress runner75 on a linear combination of minority_dummy,above_insurance and opening_balance. All the variables mentioned in the formula, must be columns of the data-frame dat_trans. As link function, the standard normal density is used. na.action is an option, which decides how to deal with NA's. If we use na.action=na.omit, all relevant observations get dropped in which NA's occur.

>

#< task_notest
# Delete all the # and insert the needed data-frame for the ??? 
# reg1=glm(runner75~minority_dummy+above_insurance+opening_balance+loanlink+ln_accountage+avg_transaction+avg_deposit_chng+avg_withdraw_chng,family=binomial(link="probit"),data=???,na.action=na.omit)
#>
reg1=glm(runner75~minority_dummy+above_insurance+opening_balance+loanlink+ln_accountage+avg_transaction+avg_deposit_chng+avg_withdraw_chng,family=binomial(link="probit"),data=dat_trans,na.action=na.omit)

#< hint
display("Just copy the formula from the example, only delete the ??? and replace it with dat_trans from Exercise 1.")
#>

#< add_to_hint
display("Just type: reg1=glm(runner75~minority_dummy+above_insurance+opening_balance+loanlink+ln_accountage+avg_transaction+avg_deposit_chng+avg_withdraw_chng,family=binomial(link=\"probit\"),data=dat_trans,na.action=na.omit)")
#>

In order to get a better understanding of the influences of single variables, we want to show the marginal effects instead of the coefficient, which are calculated by default by the glm() function. Onwards, we want to compute robust standard errors to get a more precise level of significance. All these features can be computed with the showreg() command.

< info "showreg()"

The function showreg is a useful function to visualize the most common statistics of several regression-outputs. It can be best explained by the command below:

showreg(list(reg1,reg2),robust=c(TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx"),digits=3,omit.coef="(Intercept)|ward")

The first input parameter is a list containing all the regression objects, which must be of the glm or lm type
The following command is a logical vector, which indicates if the calculated standard error is robust
The next command is of type string and must be "HAC", "cluster", "HC0" to "HC4", or "NeweyWest"
The coef.transform variable has to be a vector with as many entries as regression objects, containing "no" or "mfx"
If an entry is set to "mfx", the marginal effects of the corresponding model are shown
You can determine how precise your output should be by setting digits to a neutral number, depending on the digits after the point, which should be shown
Excluding variables can be incorporated through setting the names or part of the names of those variables. If you want to exclude various variables, you have to separate them with the | operator

>

In order to not overwhelm you, the showreg() command is computed for you. In the following computation, we take care for the mentioned features. Further, we don't show the Intercept. For more clarity, we round all our results to the 5th decimal place.
The only thing you have to do is to click on the check button.

#< task
library(regtools)
showreg(list(reg1),robust=c(TRUE),robust.type="HC0",coef.transform=c("mfx"),digits=3,omit.coef="(Intercept)")
#>

To interpret each variable, we first will describe the output in general. The marginal effect and the p-value represented by stars is reported in the first row. You are able to see how the p-value and the stars are related at the bottom of the table. In the second row, the robust standard errors are shown in parenthesis.
The first two rows at the bottom of the table show some measures of the relative quality of the statistical model. Two popular measures are the AIC and the BIC. Interpreting these measures only makes sense if we have another model with the same dataset, to which we can compare these two measures. The third row indicates the log likelihood of the maximized common log density of the estimated coefficients. It is always a negative number because the common density is restricted to the [0,1] interval. Nevertheless, the log of the common density is negative. The third number shows the deviance of the model, which is also a measure to compare models of the same dataset. The deviance measures the goodness-of-fit of the model with the estimated coefficients of the ML-estimation compared to the null model. The last row displays the number of observations.
A fist result is the effect of the insurance cover. If a depositor is above the insurance cover, the likelihood of a run increases by 32.9%. Furthermore, this result has a relatively small standard error, which explains the high level of significance. This supports the conclusion, that deposit insurance reduces depositor's panic. But if we take a closer look at depositors below the insurance cover, a rise in the opening_balance seems to increase the likelihood of running. Even though these depositors are below the insurance cover, there are some who decide to run.
Second, we recognize that the depositor-bank relationship matters. The length of this relation is measured by ln_accountage, which is highly significant. The depth of the relationship is measured by the loanlink variable, which has the third largest influence on the probability of running. Both of these variables have a negative marginal effect, which means that the larger they are, the smaller is the probability of a run.

< info "marginal effects & adjusted standard error"

Marginal effects can be calculated in various ways, but generally there are two common methods of how to compute them.
Note that the marginal effects are calculated differently for continuous and binary variables. Consider the probit model, with $k$ explanatory variables $x$ and $N$ observations: $\phi (\alpha +\sum_{i=1}^{k} \beta_{i}x_{i}r)$
1.The first method is called marginal effect at mean (MEM).
- For continuous variables $x_{j}$, the MEM is given by:
$$ MEM_{j}=\beta_{j}\phi (\sum_{i=1}^{k}\beta_{i}\bar{x}{i}) $$
- For binary variables $x{j}$, the MEM is :
$$ MEM_{i}=\phi(\sum_{i=1}^{k} \beta_{i}\bar{x}{i}| x{j}=1) - \phi(\sum_{i=1}^{k} \beta_{i}\bar{x}{i}| x{j}=0) $$

The second method is called the average marginal effect (AEM).
The AEM for the $j-th$ continuous variable is defined by:
$$ AME_{i}=\frac{1}{N}\sum_{i=1}^{N}\phi (\alpha +\beta {1}x{1i}+...+\beta_{k}x_{ki})\beta_{i} $$
If variable $x_{j}$ is binary, the AEM is:
$$AME_{i}=\frac{1}{N}\sum_{i\neq j}^{N}\phi (\alpha +\beta {1}x{1i}+...+\beta_{k}x_{ki}+\beta_{j})-\frac{1}{N}\sum_{i\neq j}^{N}\phi (\alpha +\beta {1}x{1i}+...+\beta_{k}x_{ki}) $$
If you calculate marginal effects, you have to adjust the standard errors. The standard error becomes the standard deviation of the marginal effect and is no longer just the standard deviation of the coefficient itself. . Consequently, you can always divide the estimated coefficient by the standard error in order to obtain the t-statistic. For further reading, take a closer look at: stata-journal.com/article.html?article=st0086

>

< info "p-value and stars"

Conventionally we have the following relationship between the stars and the p-value: - $$= 5%
- $$=1%
- $$= 0.1%

These values are calculated on the basis of a 2-sided t-test for the coefficients. The test statistic for a coefficient $\beta_{i}$ is computed as follows: $t_{i}=\frac{\beta_{i}}{\sqrt{Var(\beta_{i})}}$.
Where $t_{i}$ is Student-t distributed.

>

One problem, which occurs if we want to interpret the effect of continuous variables on the running probability, is that the marginal change of a continuous variable is hard to imagine. Therefore, we will now take a look at a so called effectplot, which calculates changes in the running probability in a more intuitive way.

Task: Use the raw formula of the effectplot()-function as described in the info block and plug in reg1.
Previous hint: Adapt the given code. Therefore delete the # before the command to use the code.

< info "effectplot"

The function effectplot(reg=myreg,numeric.effect="10-90") is part of the regtools-package. The first parameter is an object of regression, such as an object received by a glm formula. The second input is the underlying dataset, which should be a data frame. The last parameter is of type string and contains the quantiles, which are plugged into the probability function. It definitely helps to compare the magnitudes of the influence of different explanatory variables. The default effect is "10-90", i.e. the effect of when -ceteris paribus- changing an (numeric) explanatory variable from its 10%- to its 90%-quantile.

The following code examples, will explain some of the parameters of the function in more detail:

Raw formula:
To get to the raw formula, type in the following:

effectplot(reg1,numeric.effect="10-90")

Drop numbers:
In the formula above the bars are labeled by default. The variable add.numbers is set to TRUE`.
If you only want to display the bars without a label, type in:

effectplot(reg1,numeric.effect="10-90",add.numbers=FALSE)

Heading adjustment:
If you want to show a heading, proceed as shown in the example below:

# the \n is used not to squeeze the plot
effectplot(reg1,numeric.effect="10-90",main="MyHeading\n")

Omit variables:
Sometimes you want to exclude a certain variable with a large effect, to see the other smaller effects in more detail. Therefore use:

effectplot(reg1,numeric.effect="10-90",ignore.vars="above_insurance")

Show confidence levels:
To show the 95% confidence level of the effect, use:

# the \n is used not to squeeze the plot
effectplot(reg1,numeric.effect="10-90",show.ci=TRUE)

>

#< task_notest
# only replace the ??? with the metioned regression
# effectplot(???,numeric.effect="10-90")   
#>

effectplot(reg1,numeric.effect="10-90")


#< add_to_hint
display("Type: effectplot(reg1,numeric.effect=\"10-90\")")
#>

Finally things are getting much clearer: the result shows the percentage of change in the running probability if we take the difference of the 90%-quintile to the 10%-quintile of the related variable and set the other variables to their mean. Thus, the effect of a change in a regarded variable becomes more intuitive than just looking at the deviation, which only reports marginal changes. For example, the interpretation of the effect of the opening_balance is now more intuitive:
If a depositor has a rise in his balance at the day of the run from RS. 124 to Rs. 6330, his or her running probability rises by 0.75%. We further see that the loanlink and ln_accountage are highlighted in red. These two factors are the only influences, which reduce the running probability.

Exercise 4 -- Stata vs. R

In the next step, we want to extend our calculated regression as in the replicated paper. We first want to include the variable travel_costs and second control for the variable ward.
This part deals with the problems, which occur if you want to replicate Stata regressions with R. As usual, we load the needed data. We further need the regression from the last exercise.
Just click on edit first and then on check.

#< task
dat_trans=read.table("data_for_transaction_accounts.dat")
reg1=glm(runner75~minority_dummy+above_insurance+opening_balance+loanlink+ln_accountage+avg_transaction+avg_deposit_chng+avg_withdraw_chng,family=binomial(link="probit"),data=dat_trans,na.action=na.omit)
#>

< info "control for variable"

If we control for a variable, we want to check whether there is an impact of the controlled variable on the regression. For this purpose one takes the variable one want to control for into the regression formula and checks if there are any striking differences in the other coefficients.
In our case, we will only control for the variable ward. This variable is a discrete number, which ranges from 1 to 88. Every number represents one ward in the town, where the bank was placed. To get a better measure of the impact of each ward, we therefore create a dummy variable for each ward. This means, that we introduce 87 binary variables and select one ward as the reference class.

If you are interested in reading more, visit stats.stackexchange.com/questions/17336/how-exactly-does-one-control-for-other-variables

>

< info "variables of interest 2"

Looking in addition to travel_costs takes the possible influence of the distance into account. One could argue that the further afar a depositor lives, the lazier he or she is to run. Also he could compare the benefit with respect to the costs he has to pay in order to get his money. This argumentation will accompany us in the next exercise.
Further we could imagine, that the running decision is dependent on a specific ward. Maybe some people in a certain ward do have better information than others and thus don't run. Also the behavior of runners in one ward could influence the other depositors in this ward. This effect is measured by the variable ward.

>

Before we start to regress the dependent variable on our new set of variables, we have to prepare the original dataset. For a better measuring of the variable ward's impact, we create dummy variables for each ward.

Task: Apply the function factor() to the column ward of the dataset dat_trans. Operate on the single column with the $-operator.
Previous hint: In this task you transform the original dataset, thus store your results in dat_trans$ward.

< info "factor"

The function factor() is used to encode a vector as a factor. As single input it needs the vector to be factored. You will need this formula to prepare a categorical variable in your dataset so that a later called regression formula like glm() creates a dummy-variable for each category.

If you want to factorize a single column of your dataframe, use:

dat_trans$ward=factor(dat_trans$ward)

>

< info "$-operator"

With the $ operator, you can select single column out of a data-frame. It can be best explained by a code example:

single_column=mydata$col1

The return value assigned to the variable single_column is of the type of the singe column and not of the type of the whole data frame.

>

dat_trans$ward=factor(dat_trans$ward)
#< hint
display("Look at the \"factor\" info-block.")
#>
#< add_to_hint
display("Type: dat_trans$ward=factor(dat_trans$ward)")
#>

After the preparation we now want to conduct the regression, which leads us to the following problem:

Stata vs. R: Since we try to replicate the paper, we now came to a crucial point. If you run a regression that includes the factorized ward variable in Stata, it would give you the following warning: ward17 != 0 predicts failure perfectly - 14 obs not used. What this mean can be shown in the following graph:

#< task
X=dat_trans # (1)
X$ward=as.factor(X$ward)  #(2)
M=model.matrix(runner75~ward-1,X)  #(3)
M=cbind(model.frame(runner75~ward,X)[1],M) #(4)
M=M[order(M[,1],decreasing=T),] #(5)
ex=M[,c("runner75","ward17")]  #(6)
ex[ex$ward17==1,] #(7)
coef(glm(runner75~minority_dummy+above_insurance+opening_balance+loanlink+ln_accountage+avg_transaction+avg_deposit_chng+avg_withdraw_chng+ward,data=X,family=binomial(link="probit"),na.action=na.omit))["ward17"] # (8)
#>

If this code may look strange to you, it can be briefly explained:
(1): First we create a copy of the dataset dat_trans.
(2)+(3): Then we factorize the ward-variable to get a dummy for each ward in the town to better understand the impact and so control for the effect on other estimates.
(4): Next we construct a dataset consisting only out of the dependent variable and the dummy variables.
(5)+(6): We sort this dataset according to the dependent variable and extract the ward17 column and the dependent variable.
(7)+(8): Finally we show only the cases where the ward15 dummy takes the value of 1. We see that in each case, the dependent variable is always 0. We could say: if the ward17 dummy equals 1, it perfectly predicts runner75 to be zero.

If we look at the estimated coefficient of the ward17 variable, we see that it is extremely large. A coefficient of -3.81 for a dummy variable means, that if the value of the dummy is one, the probability of a run is sharply decreasing.

Stata drops automatically all of these variables. Thus to fully replicate the paper we need a function, which drops all the perfect predictors.
I wrote a function called binary.glm, which does exactly the same what Stata does in case of perfect prediction: In case of a dummy variable, the perfect prediction variable is deleted with all observation for which the dropped variable predicts the dependent variable perfectly. The output consists of the name and the dropped variables. If one wants to compute standard errors clustered at a variable later on, one has the option to set the input parameter clustervar1.

To get all the explanatory variables plus the cluster variable in the underlying data frame of the regression, use model.frame().

Task: Use the function binary.glm() to regress runner75 on minority_dummy, above_insurance, opening_balance, loanlink, ln_accountage, avg_transaction, avg_deposit_chng, avg_withdraw_chng, ward and travel_costs. Also add the adress variable as a cluster variable and display the dropped variables. Store your command in the variable reg2.
Previous hint: Delete the # before the green-inked command and operate on this command.

< info "binary.glm()"

binary.glm(formula,link,data,clustervar1,show.drops) has three obligatory variables and two optional variables. The first variable in the function is used as in a general glm formula. The second variable is the link-function of the binary regression and is of type string. It can be either "probit" or "logit". The data can be the original dataset, so one doesn't need to assign a subset of the data. clustervar1 is optional and needs to be a variable of the dataset one later wants to cluster on. The data type is also string. The last parameter is of type boolean. If evaluating at TRUE, the function prints all the perfect predictors. We further give a code-example, to make the following tasks easier.

basic command: This example shows all possible parameters of binary.glm()

reg=binary.glm(runner75+minority_dummy+above_insurance+opening_balance,link="probit",data=dat_trans,clustervar="adress",show.drops=TRUE)

no clustering: If you don't need to cluster, just leave the variable

reg=binary.glm(runner75+minority_dummy+above_insurance,link="probit",data=dat_trans,show.drops=TRUE)

Don't show the dropped variables:
If you want to hide the dropped variables, you only have to set show.drops=FALSE

reg=binary.glm(runner75+minority_dummy+above_insurance,link="probit",data=dat_trans,clustervar="adress",show.drops=FALSE) #

>

#< task_notest
# This time a cod example is given. You only need to adjust the ??? with the correct Boolean.
# reg2=binary.glm(formula=runner75~minority_dummy+above_insurance+opening_balance+loanlink+ln_accountage+avg_transaction+avg_deposit_chng+avg_withdraw_chng+ward+travel_costs,link="probit",data=dat_trans,clustervar="adress",show.drops=???)

#>
reg2=binary.glm(formula=runner75~minority_dummy+above_insurance+opening_balance+loanlink+ln_accountage+avg_transaction+avg_deposit_chng+avg_withdraw_chng+ward+travel_costs,link="probit",data=dat_trans,clustervar="adress",show.drops=TRUE)
#< hint
display("Very often you make typing faults. To avoid them us the given command.")
#>

After having calculated and adjusted the regression, we now want to visualize our results.
It would be favorable to show both regressions in one table so that we can check if the marginal effects changed in the second regression.

Task: Now use the command showreg() to get a summary table of your calculated regressions: reg1 and reg2.
Previous hint: Proceed as in the given example.

#< task_notest
# Replace the ??? with the second regression computed above, to get the regression table:
#showreg(list(reg1,???),robust=c(TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx"),digits=3,omit.coef="(Intercept)|ward")
#>
showreg(list(reg1,reg2),robust=c(TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx"),digits=3,omit.coef="(Intercept)|ward") 

#< hint
display("only use reg2 instead of the ???")
#>

#< add_to_hint
display("Use: showreg(list(reg1,reg2),robust=c(TRUE,TRUE),robust.type=\"HC0\",coef.transform=c(\"mfx\",\"mfx\"),digits=3,omit.coef=\"(Intercept)|ward\") ")
#>

We see that the differences of the marginal effects are very small, if we add the explanatory variables travel_costs and ward. This means that the effect of these two variables doesn't seem to change our findings. The significance levels dont seem to change dramatically either. We could say that our results from reg1 are robust to these influences.

Out of the table we recognize two important factors:
1. The effect of the insurance cover on the running probability is the largest and also highly significant.
2. The negative effect of the loan linkage is the second largest, with a very small p-value

Think about an economic explanation for these findings. The effect of an insurance cover seems clear: if one is insured there is no incentive to run. The impact of the loan linkage isn't that clear for us. Also the relation of these two effects should be investigated more intensive. So, in the next sub chapter we focus on the relation of these two influences.

< info "additional interpretation: comparing the models according to BIC and AIC"

Remember that the ML estimation, tries to estimate the true density of the dependent variable $p$. The Kullback-Leibler Divergence is a measure of the difference of the true density $p$ and the estimated density $\hat{p}$. Intuitively, AIC and BIC try to overcome the difficulty to compare an estimate $\hat{p}$ with the real $p$, because in real-life we don't know the true model. Abstractly spoken we want to measure the information loss, which is produced by taking the estimated model instead of the true model. The better the model is, the smaller this difference is. Thus we can follow, the smaller either BIC or AIC is, the better the model is.

For a brief and general overview, you can look at: en.wikipedia.org/wiki/Akaike_information_criterion

>

Exercise 5 -- Relation of a Loan and Insurance Cover (Chapter IV, Table 3)

Now we know that having a loan linkage decreases the running probability of a depositor. On the other hand we know, that being above the insurance cover leads to a large increase of the running probability. It seems to be, that these two variables react contrary to each other. Therefore it would be interesting to know, if a depositor who is above the insurance cover might not run if he had a loan relation. For this purpose, we introduce two variables: uninsured_rel and uninsured_no_rel.

Before you start, load the needed data. For this purpose, first click on edit and then on check.

#< task
dat_trans=read.table("data_for_transaction_accounts.dat")
dat_trans$ward=factor(dat_trans$ward)
org_dat=dat_trans
#>

< info "variables of interest 3"

If we want to examine the effect of having a loan linkage or not and being above the insurance cover, we introduce two binary coded variables:
- uninsured_rel: which means that a depositor is above the insurance cover and has a loan linkage
- uninsured_no_rel: as uninsured_rel but without a loan linkage

>

Task: Run a regression similar to reg1: - Use the binary.glm() function - Take the explanatory variables as in reg1 but replace above_insurance with uninsured_no_rel and uninsured_rel and regress them on runner75 - Show the dropped variables. - Store your results in reg3

#< task_notest
# Only adjust the ??? with the mentioned variables. Add them in the same order as mentioned!
#reg3=binary.glm(runner75~minority_dummy+???+???+opening_balance+ln_accountage+loanlink+avg_transaction+avg_withdraw_chng+avg_deposit_chng,link="probit",data=dat_trans,show.drops=TRUE)
#>


reg3=binary.glm(runner75~minority_dummy+uninsured_no_rel+uninsured_rel+opening_balance+ln_accountage+loanlink+avg_transaction+avg_withdraw_chng+avg_deposit_chng,link="probit",data=dat_trans,show.drops=TRUE)

#< hint
hint("Just copy the regression formula from reg1 and replace as mentioned. The other parameters are: link=\"probit\",data=dat_trans,show.drops=TRUE")
#>

No Runners above Insurance Limit who have Loan Linkages

Look at the output of the previous code chunk. The first entry shows, that uninsured_loan predicts runner75=0 perfectly. Whenever the variable uninsured_loan takes the value of one, the variable runner75 is always zero. In order to better understand this result, we compute the sum of runners for each possible combination of above_insurance and loanlink.

Just click on check, to get the mentioned computations.

#< task
summarise <- dplyr :: summarise
summarise(group_by(dat_trans,above_insurance, loanlink), num.runners = sum(runner75))
#>

From the first exercise, we know that we have 307 runners. These runners are grouped as follows: 259 depositors are under the insurance cover and without loan linkage are running. Seven depositors with a loan linkage and under the insurance cover are running. Considering the depositors above the insurance cover, 41 run if they have no loan linkage. For depositors, which are above the insurance cover and which have a loan linkage, we got surprising and interesting findings: No runners occurred in this group. This highlights the importance of a loan on the running decision.

If we didnt drop the variable uninsured_rel and estimate the coefficient of this variable, it would be unusual large. In addition the mentioned paper cant be replicated.

Click on check, to validate the statement.

#< task
reg3.1=glm(runner75~minority_dummy+uninsured_no_rel+uninsured_rel+opening_balance+ln_accountage+loanlink+avg_transaction+avg_withdraw_chng+avg_deposit_chng,family=binomial(link="probit"),data=dat_trans,na.action=na.omit)
coef(reg3.1)["uninsured_rel"]
#>

With a value of -2.34, the magnitude of the coefficient is very large and shifts the running probability close to zero if the variable uninsured_rel equals one. This large coefficient has its origin in the selected estimation method. This phenomenon can be explained intuitively: the ML-estimation maximizes the probability of the observed sample. If there is one variable, which predicts the dependent variable perfectly, the likelihood can be most increased by scaling up this variable. For this reason the related coefficient is set as large as possible.

Task: Use the raw function effectplot() to visualize your estimates of reg3. Set the heading to main="Change in running probability\n".
Previous hint: If this task seems to be too tricky, look at the info-block of effectplot and copy the command. Afterwards, do your adjustments!

effectplot(reg3,main="Change in running probability\n")

#< hint
display("Did you forget to use the variable org_dat?")
#>

#< add_to_hint
display("Type: effectplot(reg3,main=\"Change in running propability\n\")")
#>

What you can see here is very clear-speaking: the effect of uninsured_no_rel shows, that if a depositor hasn't a loan linkage and is above the insurance cover, the running probability rises dramatically. This highlights the importance of the insurance cover, which remains the largest effect.

We drop the variable uninsured_no_rel to visualize the effect of the other variables in a more detailed version. Further, we show a 95% confidence interval of the effect. You only have to click on the check-button to display the plot.

#< task
effectplot(reg3,ignore.vars="uninsured_no_rel",show.ci=TRUE)
#>

The smaller the confidence interval, the more precise is the estimated effect. In general, a confidence level with significance level of 5% tells us in which area the effect lies with probability 95%. For example the estimated effect of the opening_balance variable lies in a relative small confidence interval. It lies in the closer area of 0.75%. This feature fosters our opinion, that the opening balance will have indeed a significant effect on the running decision. If we look at the effect of avg_transaction, the confidence interval ranges from a negative value up to a positive. We cannot be totally sure, whether the estimated effect is larger than zero. Therefore, we don't judge it as an important factor on the running decision.

Exercise 6 -- Importance of Bank-Depositor Relationship

6.1 Load Data

As always we will load the dataset, on which we drive our analysis. This loading will be done automatically but the download itself has to be done manually. So download the dataset "data_for_survey.dat" into your current working directory. After that, you only need to first click on edit and then on check.

#< task
dat_trans=read.table("data_for_transaction_accounts.dat")
dat_trans$ward=factor(dat_trans$ward)
# data for the second task
dat_survey=read.table("data_for_survey.dat") 
#>

6.2 Why are Depositors with a Loan Linkage less likely to run?

Now we have some very interesting findings: a loan relation does not only reduce the running probability of depositors in general (first regression), it also keeps the uninsured depositors away from running (third regression).
This may have three reasons:
1. Depositors think that their outstanding loan is offset by their deposits.
2. Depositors get information about the true health of the bank and thus don't run.
3. There may be some socio-economic reasoning, e.g. the wealth.

The first thought can be discarded because in India it isn't allowed to offset outstanding loans against deposits in case of a default. The second reason sounds very interesting and can be proved easily with our dataset.

a) Information Source (Chapter IV, Table 4)

In this subchapter we want to check the hypothesis whether a loan relation possesses is a source of information and therefore creates some information value. To accomplish that, we look at depositors which had a loan before the bank run and at those who will have a loan in the future. Therefore we introduce a set of new variables:
Loanlink_before, loanlink_current and loanlink_after.
Look at the description to get more information.

We will first run a regression without the variable loanlink_after. In a second regression we include this variable and measure if there is an effect of the loanlink_after on other the coefficients of the other variables.

< info "variables of interest 4"

To check whether a loan is some kind of information source, recall that only a current loan or one in the past can be a source of information: normally if one has a loan, one has to occur at certain points in time to talk to a loan officer or to renegotiate ones loan conditions.
A future loan does not have this feature of a current relation. Prospective borrowers only have to fulfil the mentioned obligations in the future. This can be measured through the variable:
- loanlink_after

Having a current outstanding loan is measured by:
- loanlink_current

A loan link in the past is measured through:
- loanlink_before

>

Task: Run a regression similar to reg4. Add first the variable loanlink_after, then ward and last travel_costs. Further, set clustervar="adress" and show.drops=FALSE. Store your result in the variable reg5.
Previous hint: You see that there is already a command in your chunk. This command is part of the solution and mustn't be deleted.

#< task
# The first regression is done for you, to avoid long typing.
reg4=binary.glm(formula=runner75~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink_current+loanlink_before+avg_withdraw_chng+avg_deposit_chng+avg_transaction,link="probit",data=dat_trans,show.drops=FALSE)
#>

#< task_notest
# Only replace the ??? with the mentioned variables. Add them in the mentioned order!
#reg5=binary.glm(formula=runner75~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink_current+loanlink_before+avg_withdraw_chng+avg_deposit_chng+avg_transaction+???+???+???,link="probit",data=dat_trans,clustervar="adress",show.drops=???)
#>
reg5=binary.glm(formula=runner75~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink_current+loanlink_before+avg_withdraw_chng+avg_deposit_chng+avg_transaction+loanlink_after+ward+travel_costs,link="probit",data=dat_trans,clustervar="adress",show.drops=FALSE)

Task: Now use the showreg() command to show the results of reg4 and reg5:
Calculate robust standard errors according to HCO and show the marginal effects.
Round to the 4th decimal place by setting digits=4 and don't show the Intercept and the ward dummies.
Previous hint: Look at the info-block of showreg!

showreg(list(reg4,reg5),robust=c(TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx"),digits=4,omit.coef="(Intercept)|ward")
#< hint
display("Remember that some of the input parameters of showreg needs to be written in a vector, e.g.: robust=c(TRUE,TRUE). For further help, look at the info-block.")
#>
#< add_to_hint
display("Use: showreg(list(reg4,reg5),robust=c(TRUE,TRUE),robust.type=\"HC0\",coef.transform=c(\"mfx\",\"mfx\"),digits=4,omit.coef=\"(Intercept)|ward\")")
#>

What we see here, hits exactly our guess, that a loan linkage has an information value: the effect of a future loan linkage (loanlink_after) is very small and not significant. But the effect of the past loan (loanlink_before) and the current loan (loanlink_current) are larger and much more significant. We can conclude that a future loan has no influence on the decision to run or not to run. That's because a depositor doesn't gain any additional information out of a future relation. The value of the information may come from the conversations of the loan-officer with the depositor or maybe from the fact that the depositor has to go to the bank more often as other depositors without a loan and thus has more chances to get some information about the banks health. Further we now can explain the coefficient of the ln_accountage variable: the older the relation of the bank and the depositor is, the more information can be gained about the health of the bank. This leads to a higher trust in the bank and so keeps the depositor away from running.
In the prevailing banking literature, the importance of bank-depositor relationship is highlighted. For example in Goldstein and Pauzner (2005), depositors receive noisy signals about the health of the bank. We now could add, that depositors who had a loan at the bank will receive higher signals. A reason for this might be the interaction with the related loan officer. As Diamond and Dybvic (1983) found out, a bank run depends on depositors belief in the ability of the bank to pay the promised payments. The trust in a bank might therefore get fostered through a loan and make this bad equilibria less likely. Finally, we could guess that a depositor is afraid of losing a potential source of financing. Thus depositors with a loan linkage might have less incentive to run in order to not risk the financing of future projects.

b) Socio-economic Background (Chapter V, Table 12)

In this subchapter we check whether the last thought, that the running behavior is influenced by socio-economic factors, is reasonable for explaining the loan relation. Therefore we need to get more detailed information about the depositors than we actually have. This detailed information can be gained through a survey, which contains a list of questions regarding to the socio-economic background of a single depositor.

< info "survey"

To get a representative sample one has to choose the observations randomly. In our case, 100 depositors who withdrew from their transaction account and 300 depositors, who didn't withdraw were selected. These depositors all belong to different households so that there won't be any correlations between these observations (no clustering is needed). Despite all this, only 282 depositors could be visited because the interviewer didn't meet them all of them on the day of the survey.

In the survey, the depositors had to make statements about their properties: It was asked whether the depositor has an apartment, car, bike or land. In dependence of the answer, a variable wealth was constructed, which sums up the relative part of the named items that a depositor holds. One could assume that the more a depositor owns, the more he or she is harmed by the default of the bank and thus runs very early. Also the depositors age and their education are taken into account. Education was measured as follows: high-school, bachelor degree or master degree. One could argue: the better the depositor is educated, the more realistic is his or her estimate of the health of the bank. Moreover the depositor was asked if he or she has stocks. This could be an indicator for a depositor to run because he faced huge losses from his stock investments. The other statements were similar to our first dataset from Exercise 1.1 b)

>

To keep focused on the socio-economic background of a depositor, we select some variables of interest. We focus on the depositor's age, measured in years, the amount of stocks he or she has and his or her wealth. The wealth is measured as follows: The depositor is asked whether he or she has a bike, land or an apartment. Then the single assets were weighted on the total amount of assets the depositors have in sum. These three ratios were then added up and represent the wealth variable.

Task: Use the selecet() function to extract the variables runner75, stock, age, education, wealth, education_dummy1 and education_dummy2 out of dat_survey and store it into subset2.
Previous hint: In the last task of Exercise 1 you already did a similar command.

< info "select"

The select() command is part of the dplyr-package. It extracts certain columns out of the given dataset. It thus returns a subset of the original dataset.

For the concrete use, we provide a code example.

select(mydata,col1,col2,col3)

In general, there are various ways to select subsets of data. A useful page in general is the QuickR page. In our specific case of data-selection, look at: statmethods.net/management/subset.html

>

subset2=select(dat_survey,runner75,stock,age, education, wealth, education_dummy1,education_dummy2)
#< hint
display("Look at the second exercise if you want further examples on how to use select.")
#>

#< add_to_hint
display("Only type: subset2=select(dat_survey,runner75,stock,age, education, wealth, education_dummy1,education_dummy)")
#>

< info "variables of interest 5"

When we want to explain the running behavior with socio-economic factors, we have to think about different characteristics of a single depositor and his reasons for his running behavior.
We first look at the stocks. Stocks may indicate a potential liquidation pressure of deposits to offset losses resulting from the stock markets. This might be a reason for depositors with stocks to run immediately. This effect is measured by the variable:
- stock

Also the wealth of a depositor could matter. This is measured by:
- wealth

The education of a depositor may give a conclusion of his intelligence. The more intelligence a depositor is, the more realistic should be his picture of the banks health. One could follow that the better the education of a depositor is the higher the likelihood that he stays at home, because in our case, the bank is solvent. We take this into account using the variable:
- education
- eduaction_dummy1
- education_dummy2

Last we take the age into our analysis. Imagine someone very old having only a few years left. Knowing this, would he or she run? The age is accounted by:
- age

>

Now after the loading, we group the observations into runners and stayers and check whether there are significant differences. We want to answer the question whether socio-economic reasons may influence the running decision. If there is an influence, there should be some discriminating power of these socio-economic factors. Therefore, you now will develop the function called TTest which you already used before. This function is used in combination with sapply(data, function(x)). Recall that sapply applies the function to each of the columns of the underlying dataset.

Task: Write a function called TTest. Type the commands step by step, as mentioned in the task here: - to create the function, write: TTest=function(x,y) { - in the next line, write: output=t.test(x~y)[c("estimate","p.value","statistic")] - now make one list out of the fractions and write into a new line: output=unlist(output) - the return value doesn't have to be marked, only type into the next line: round(output,3) - finally close your command , by writing into the next line: }

TTest=function(x,y) {
  output=t.test(x~y)[c("estimate","p.value","statistic")]
  output=unlist(output)
  round(output,3)
}
#< hint
display("Did you choose output to store the results of the t.test? For the second step write output=unlist(output)")
#>

Task: Use the sapply() function to perform a t-test. Group the input data on the variable runner75.
As data input use: subset2[-subset2$runner75].
Previous hint: If you can't remember how to use the function, look at the last task of Exercise 2. Delete the # in front of the given code and then adjust it.

#< task_notest
# Only adapt the ???
# t(sapply(???,function(x) TTest(x,subset2$runner75)))
#>

t(sapply(subset2[-subset2$runner75],function(x) TTest(x,subset2$runner75)))
#< hint
display("Look at the last task of Exercise 2 to remember the syntax")
#>
#< add_to_hint
display("Just type:t(sapply(subset2[-subset2$runner75],function(x) TTest(x,subset2$runner75)))")
#>

The output shows that the mean for all variables is very similar. This means that on average there is no clear trend in a certain direction of the decision. Runners as well as stayers seem to have the same socio-economic properties. Further, all variables are not even significant on the 5% level. Putting all this together, it looks as if socio-economic factors don't have an impact on the decision run-stay.

For proving this assumption, we run a probit regression to measure the changes in the running probability in dependence to these effects.

Task: Use the binary.glm() function to regress runner75 like in reg6. Also add the variables wealth, stock and age to the regression formula and store your results in the variable reg7.
Previous hint: Just copy the command and then do the adjustments. Don't delete the given example, it's part of the solution.

#< task
# copy the code below and then do your adjustments. Add the mentioned variables at the end of the regression formula in the given order!
reg6=binary.glm(runner75~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink+avg_deposit_chng+avg_withdraw_chng+avg_transaction+education_dummy1+education_dummy2,data=dat_survey,link="probit",show.drops=TRUE)
#>

reg7=binary.glm(runner75~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink+avg_deposit_chng+avg_withdraw_chng+avg_transaction+education_dummy1+education_dummy2+wealth+stock+age,data=dat_survey,link="probit",show.drops=TRUE)

#< hint
display("The regression formula is first input parameter of binary.glm. So only add variables in the order they were mentioned.")
#>

#< add_to_hint
display("Type: reg7=binary.glm(runner75~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink+avg_deposit_chng+avg_withdraw_chng+avg_transaction+education_dummy1+education_dummy2+wealth+stock+age,data=dat_survey,link=\"probit\",show.drops=TRUE)
")
#>

What we see, is that the loanlink predict the variable runner75=0 perfectly. This means that all questioned depositors with a loan didn't run. Bear this in mind for our interpretation!

In order to not bore you by always typing in the same commands, we directly show the output.
So you only have to press check.

#< task
showreg(list(reg6,reg7),robust=c(TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx"),digits=3,omit.coef="(Intercept)|ward")
#>

It is remarkable that the loanlink predicts the behavior "staying at home" perfectly. This fosters the importance of a loan linkage, which then is independent of socio economic factors. Also being above the insurance cover shifts the running probability about more than 60%, which is enormous. Moreover, this coefficient is highly significant.
Regarding the socio-economic factors, we observe the following: The stock investments don't have a significant influence on the running probability, which means that the depositors decision isn't due to a liquidity shock experienced by stock losses. Also the age, education or the total wealth don't seem to influence the running probability, which makes our findings of the loan linkage and the insurance cover robust to controlling for age, wealth and education.

Exercise 7 -- Influence of Social Networks

We now step back to Exercise 2, where we stated that the decision to run depends on the information a depositor has about the fundamentals of the bank. This information can be gained from internal sources such as a direct relation to the bank or through external sources such as contacts with other depositors. So the decision of other depositors could influence someones decision whether to run or not. To measure the effects of social networks, we have to structure these external sources. First, we measure the so called introducer network: A common requirement for banks in India is to ask a depositor who wants to open an account to be introduced by a depositor who already has an account at the bank. The purpose of this requirement is to identify the new depositor, as in India there has been no common social security number the country. Therefore we assign all depositors who have the same introducer to one network. Second we measure the neighborhood network by looking at the ward in which a depositor lives. So, all depositors living in the same ward have the same value of the ward-variable.

7.1 Load Data

We proceed as in every exercise and first of all load the dataset on which we drive our analysis. These loadings will be done automatically so that you only need to first click on edit and then on check.

#< task
dat_trans=read.table("data_for_transaction_accounts.dat")
dat_trans$ward=factor(dat_trans$ward)
#>

7.2 Distribution of Runners

In this sub-chapter, we try to find a pattern, which shows the relation of runners and wards. Maybe we can see that all runners come from a specific ward and therefore could assume, that the running decision is influenced by the decision of the depositors living in the same ward.
To get a first overview and an intuition of the meaning that a network influences the decision to run, we do the following:

1.Step:

Task: Apply group_by to the dataset dat_trans. Group by the variable ward and store your result in dat_ward.
Previous hint: If you forget how to use the function, look at Exercise 2.

dat_ward=group_by(dat_trans,ward)
#< hint
display("Look at Exercise 2.3. The first info-block contains all information.")
#>

2.Step:

Task: Use the summarise function to sum up the runners in each ward. Therefore set the first input parameter to dat_ward. Store your results in the variable ward_runner.
Previous hint: Delete all the # before the green inked code and work with the given commands.

#< task_notest
# replace the ??? with the mentioned function
# summarise <- dplyr :: summarise
# ward_runner=summarise(???,SumRunner=sum(runner75))
#>
summarise <- dplyr :: summarise
ward_runner=summarise(dat_ward,SumRunner=sum(runner75))
#< hint
display("Only type: ward_runner=summarise(group_by(dat_trans,ward),SumRunner=sum(runner75))")
#>

Now after you've summed up the runners in each ward, we should think on how the depositor's location measured by the ward influences the running decision. Notice, that the ward-variable could be constructed as follows: The city can be viewed out of a bird's perspective, looking at a Cartesian coordinator system. This means that we divide the city into squares and now give every square a number starting from the top left to the down right. If we now could observe many runners in one ward and some runners at the neighbor ward, we could assume that there is some information spreading around the ward, which affects people in the surrounding area.

3.Step:

Task: Use ggplot() to draw a graph as in the example.
Use ward as the x-axis and SumRunner as y-axis.
Previous hint: Delete all the # before the commands and directly work with them.

#< task_notest
# Just replace the ??? with the mentioned variable.

# ggplot(ward_runner,aes(x=???,y=???,fill=factor(ward)))+
#   geom_histogram(stat="identity")+   
#   theme(legend.position="none")+
#   ggtitle("Sum of Runners in a certain Ward\\n")
#>
 ggplot(ward_runner,aes(x=ward,y=SumRunner,fill=factor(ward)))+
   geom_bar(stat="identity")+
   theme(legend.position="none")+
   ggtitle("Sum of Runners in a certain Ward\n")

If you look at the graph, you see that the runners are concentrated around the large bars. Each bar represents a ward and is shown in a different color. It roughly looks like a Gaussian curve with the respective extreme value as maximum. This pattern reminds on the following:
The propagation of somebody's information of the health of the bank and his behavior of running is propagated like in the game: "whisper down the lane". Here someone at the start of the lane whispers a statement to his neighbor. The neighbor only understands the half of the information and whispers this information to his neighbor. He also understands just the half of it...and so on. At the end of the line, you got very noisy true information. That's why the direct neighbors of a ward are strongly influenced by the behavior of the runners in this ward. The further afar we move, the less people are influenced by this behavior. In a first step, we try to get an overview over the wards and the runners within the ward.

7.3 Regression Analysis (Chapter V, Table 5)

After having found some interesting patterns, we look more detailed on how the running probability is influenced by depositors of a certain network.
Therefore we run three regressions: The first regression is ran with the common explanatory variables but in addition only with the social_runners. The second regression uses all common variables plus ward_runners. The last regression includes social_runners,ward_runners and the common variables as used before.

Task: Make use of the glm() function, to run the third mentioned regression.
Copy the regression formula from reg9 and only add social_runners at the end of the regression formula.
Store your results in reg10.
Previous hint: Just copy the command of regression reg9 and then do the adjustments. Don't delete the given example, it's part of the solution.

#< task
reg8=glm(runner75~minority_dummy+ln_accountage+above_insurance+ opening_balance+ loanlink+social_runners+avg_deposit_chng+avg_withdraw_chng+avg_transaction,family=binomial(link="probit"),data=dat_trans,na.action=na.omit) #only with social_runners
reg9=glm(runner75~minority_dummy+ln_accountage+above_insurance+ opening_balance+ loanlink+ward_runners+avg_deposit_chng+avg_withdraw_chng+avg_transaction,family=binomial(link="probit"),data=dat_trans,na.action=na.omit) # only with ward_runners
#>
reg10=glm(runner75~minority_dummy+ln_accountage+above_insurance+ opening_balance+ loanlink+ward_runners+avg_deposit_chng+avg_withdraw_chng+avg_transaction+social_runners,family=binomial(link="probit"),data=dat_trans,na.action=na.omit)
#< hint
display("The regression formula of reg9 is:runner75~minority_dummy+ln_accountage+above_insurance+ opening_balance+ loanlink+ward_runners+avg_deposit_chng+avg_withdraw_chng+avg_transaction. Only add social_runners at the end of the formula.")
#>

#< add_to_hint
display("Type: reg10=glm(runner75~minority_dummy+ln_accountage+above_insurance+ opening_balance+ loanlink+ward_runners+avg_deposit_chng+avg_withdraw_chng+avg_transaction+social_runners,family=binomial(link=\"probit\"),data=dat_trans,na.action=na.omit))
")
#>

We first want to show or regression findings in a table, to make them comparable.

Task: Use showreg() to show all coefficients of the three regressions you calculated above.
Report the marginal effects with robust standard errors according to HC0 for all of the three regressions.
Round to the 5th decimal place by setting digits=5 and don't show the Intercept.
Previous hint: Only delete the # before the green inked command and then adapt it.

#< task_notest
# Only adapt the ???
# showreg(list(reg8,???,reg10),robust=c(TRUE,TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx","mfx"),digits=???,omit.coef="(Intercept)")
#>

showreg(list(reg8,reg9,reg10),robust=c(TRUE,TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx","mfx"),digits=5,omit.coef="(Intercept)")

In the first column, we see the estimation results of the regression where we additionally included only the social network. We see that the probability of a depositor running is increasing in the number of runners in the introducer network. Further, the coefficient of the social_runners variable is the second largest, which highlights its importance. In column two, the regression, which additionally included only the neighborhood network, is displayed. Similar to the social network, a rise in the fraction of running neighbors increases the probability of a run. Moreover, this effect is the largest, even bigger than the effect of the deposit insurance. In the third column, we take both network variables together and check for the effect on the other variables when we take these two influences together. Both effects are still significant and only decrease a bit.

Exercise 8 -- Robustness Checks

Our analysis shouldn't end up without checking our results being robust to certain influences. We thus need to think about factors, which could have an influence on our recent findings. We will adapt our probit model according to these factors and check, whether our findings remain the same.

8.1 Load Data

Download the dataset data_for_term_deposit_accounts.dat into your current working directory. Automatically, we will read the dataset on which we drive our analysis, so you only need to click on edit and then on check.

#< task
dat_trans=read.table("data_for_transaction_accounts.dat")
dat_trans$ward=factor(dat_trans$ward)
dat_term=read.table("data_for_term_deposit_accounts.dat")
#>

8.2 Robustness to the Definition of a Runner (Chapter VI, Table 10)

One could argue that our findings depend on the definition of a runner. This is indeed a reasoned objection. But remember our first bar graph, which showed how the sum of runners depends on the withdraw level - there were no striking differences. The impact of these differences on our regression coefficients can be shown, if we regress our explanatory variables against these different definitions of a runner.

Task: Copy the given command and only change the dependent variable from runner50 to runner25.
Store your results in the variable reg12.
Previous hint: Don't delete the given code. It's part of the solution and will also be tested if you click on the check button.

#< task_notest
reg11=binary.glm(runner50~minority_dummy+ln_accountage+ above_insurance+opening_balance+loanlink+avg_withdraw_chng+avg_deposit_chng+avg_transaction+ward,data=dat_trans,link="probit",show.drops=FALSE)
#>
reg12=binary.glm(runner25~minority_dummy+ln_accountage+ above_insurance+opening_balance+loanlink+avg_withdraw_chng+avg_deposit_chng+avg_transaction+ward,data=dat_trans,link="probit",show.drops=FALSE)
#< hint
display("This time you have to copy the whole function and not only the regression function.")
#>

#< add_to_hint
display("Type: reg12=binary.glm(runner25~minority_dummy+ln_accountage+ above_insurance+opening_balance+loanlink+avg_withdraw_chng+avg_deposit_chng+avg_transaction,data=dat_trans,link=\"probit\",clustervar1=\"adress\",show.drops=FALSE)")
#>

Further one could argue, that withdraws not only occur at a certain point in time. In our analysis we set the running date to March 13, 2001. All the earlier withdraws are not taken into account. Now, we extend the period and define a depositor as a runner who withdraws between March 9 and March 13, 2001. During this period, the following occurred: on the 9th of March the largest cooperative bank faced a bank run and went insolvent on March, 13.
The variable runner75_extended captures exactly the described effect.

This time you only have to click on the check button:

#< task
reg13=binary.glm(runner75_extended~minority_dummy+ln_accountage+ above_insurance+opening_balance+loanlink+avg_withdraw_chng+avg_deposit_chng+avg_transaction+ward,data=dat_trans,link="probit",show.drops=FALSE)
showreg(list(reg11,reg12,reg13),robust=c(TRUE,TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx","mfx"),digits=4,omit.coef="(Intercept)|ward")
#>

As can be seen from the table, the significant levels of e.g. the loan linkage don't change. We could say that our finding of a significant effect of a loan linkage is robust to the definition of a runner. Thus the withdraw level of the depositors doesnt matter. If we moreover extend the period in which a depositor can withdraw, we don't see any large change in the significant levels. This makes our findings further robust to the time period.

< info "Why only arguing with significance levels"

To argue only with the significant level includes two aspects. The significance level here is derived out of the t-statistic, which is the coefficient divided through the standard error of coefficient. If our significance level is low, our t-statistic is large. This can be the case if either the coefficient is large (if our significance level increased, we could then say, that the effect is larger in respect to the new defined threshold) or the standard error of the coefficient is small (which means that the estimated coefficient is very close to the mean of the coefficient).So if our significance level remains the same, you have to look at the coefficient and the standard error to see where this effect is coming from.

>

8.3 Robustness with respect to Term Deposit Accounts

So far, we only looked at transaction accounts. But like other banks our examined bank also has term deposit accounts. The purpose of these accounts is the long term money saving. Therefore, one makes a contract for leaving his money to the bank until a certain date. Usually the interest rate for such accounts is higher than for transaction accounts. If a depositor wants to withdraw his or her deposits before the contracted maturity, the depositor doesn't get the full interest payments. Only a fraction minus a penalty is paid. Therefore, a depositor having saved his money on term deposit accounts has to pay liquidation costs, which may influence his decision to run. Therefore we look at term deposit accounts and transaction accounts separately.

We now show you the regression outcomings for each table produced in the previous exercises. The only thing to do is to download the dataset. The regressions and the related findings will be done automatically.

a) Checking of Exercise 3.3 Who runs (Chapter IV, Table 2)

The following subtasks are done automatically. You only have to click on check!

#< task_notest
reg14=glm(runner~minority_dummy+above_insurance+opening_balance+ln_accountage+loanlink+ln_maturity,family=binomial(link="probit"),data=dat_term,na.action=na.omit)
dat_term$ward=factor(dat_term$ward)
reg15=binary.glm(runner~minority_dummy+above_insurance+opening_balance+ln_accountage+loanlink+ln_maturity+ward+travel_costs,data=dat_term,link="probit",clustervar1="household_key",show.drops=TRUE)
showreg(list(reg14,reg15),robust=c(TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx"),digits=3,omit.coef="(Intercept)|ward")
#>

We see that the effects are very similar to the findings in 3.3. The three findings: - being above the insurance cover increases the running likelihood - the higher the opening balance is the higher the running probability is
- having a loan linkage and a long relation to the bank, decreases the likelihood to run

The only concern seems to be the significance level of the minority_dummy, which is smaller than the dependent in transaction accounts. Furthermore, we see a variable called ln_maturity whose sign is negative. This variable measures the distance in days to the contracted maturity. The signs of the coefficients seem intuitive: The closer a term deposit account is far away from his maturity, the more penalty is to pay in case of a withdraw.

b) Checking of Exercise 5: Relation of a Loan and Insurance Cover (Chapter IV, Table 3)

#< task_notest
reg16=binary.glm(runner~minority_dummy+opening_balance+ln_accountage+loanlink+ln_maturity+uninsured_rel+uninsured_no_rel,data=dat_term,link="probit",show.drops=TRUE)
showreg(list(reg16),robust=c(TRUE),robust.type="HC0",coef.transform=c("mfx"),digits=3,omit.coef="(Intercept)|ward")
#>

The results are in line with our findings in Exercise 5 for the transaction accounts: - If a depositor is above the insurance cover and has a loan relation, he or she doesn't run - On the other hand, if a depositor is above the cover and has no loan relation, the running probability rise dramatically

c) Checking of Exercise 6.2: Why are Depositors with a Loan Linkage less likely to run? (Chapter VI, Table 4)

After we find out that loan linkages significantly reduce the running probability, we now explain this effect.

#< task_notest
reg17=glm(runner~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink_current+loanlink_before+ln_maturity,data=dat_term,family=binomial(link="probit"),na.action=na.omit)
reg18=binary.glm(runner~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink_current+loanlink_before+loanlink_after+travel_costs+ward+ln_maturity,link="probit",data=dat_term,clustervar="household_key",show.drops=FALSE)
showreg(list(reg17,reg18),robust=c(TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx"),digits=3,omit.coef="(Intercept)|ward")

#>

We get the same findings like in the regression with the transaction accounts.
Especially the main effects of the loan linkage are quiet similar:
- A future loan has no significant impact on the running decision
- A current loan has a negative impact and is highly significant
- A past loan also has a significant and negative influence

We conclude: The depositor-bank relationship may reveal information about the health of the bank and thus keeps the depositor away from running!

Exercise 9 -- Conclusion

Finally, we want to recapitulate your analysis and summarize the most important findings. We find that the insurance cover is the most powerful way to keep a depositor away from running. Uninsured depositors have a much higher running probability than uninsured. While the insurance cover helps to mitigate a run, it is only partial effective. A second finding is, that the length of the bank-depositor relationship and a past or outstanding loan are important factors to prevent the depositor from running. Now remember the third factor:

Final Task: Which factor has a significant impact on the running decision? - "stocks" - "age" - "neighbor_runners"
Assign one of these factors to the variable answer.

#< task_notest
# Just write one of the mentioned factors
answer="???"
#>

We saw that the more people in the depositor's network run, the more likely is the depositor to run.

References:

Rajkamal Iyer and Manju Puri (2012): "Understanding Bank Runs: The Importance of Depositor-Bank Relationships and Networks". In: American Economic Review 102(4),pp. 1414-1445 -Douglas W. Diamond and Philip H. Dybvig (1983): "Bank Runs, Deposit Insurance, and Liquidity". In: The Journal of Political Economy 91 (3), pp. 401-419
William H. Greene (2008): "Econometric Analysis", Sixth Edition, Pearson Education
Gunter Löffler and Peter N. Posch (2007): "Credit risk modeling, using Excel and VBA", Wiley & Sons
Dieter Urban und Jochen Mayerl (2001): "Regressionsanalyse : Theorie, Technik und Anwendung", Zweite Auflage, VS Verlag für Sozialwissenschaften
Winston Chang (2013): "R Graphics Cookbook", Winston Chang
Robert I. Kabacoff (2011): "R in Action Data analysis and graphics with R", Manning Publications
Achim Zeileis (2006): "Object-Oriented Computation of Sandwich Estimators", Journal of Statistical Software 16 (9)
David A. Freedman (2006): "On The so Called \"Huber Sandwich Estimator\" and \"Robust Standard Errors\" Covariance Matrix Estimators", The American Statistician 60 (4), pp. 299-302
Achim Zeileis (2004): "Econometric Computing with HC and HAC", Journal of Statistical Software 11 (10)
Tamás Bartus (2005): "Estimation of marginal effects using margeff", The Stata Journal 5 (3), pp. 309-329
Goldstein, Itay, and Ady Pauzner (2005): "Demand-Deposit Contracts and the Probability of Bank Runs", The Journal of Finance 60 (3), pp. 1293-1327
Kjell Konis (2007): "Linear Programming Algorithms for Detecting Separated Data in Binary Logistic Regression Models", Dissertation, University of Oxford
Hadley Wickham (2009): "ggplot2: elegant graphics for data analysis", Springer New York
Sebastian Kranz (2014): "regtools: Tools for presenting regressions results", R package version 0.1
Hadley Wickham (2007): "Reshaping Data with the reshape Package", Journal of Statistical Software 21(12), pp. 1-20
Jim Lemon and Philippe Grosjean (2014): "prettyR: Pretty descriptive stats.", R package version 2.0-8
Hadley Wickham and Romain Francois (2014): "dplyr: A Grammar of Data Manipulation", R package version 0.3.0.2

skranz/RTutorBankRuns documentation built on May 30, 2019, 2:02 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

skranz/RTutorBankRuns RTutor problem set understanding bank runs

In skranz/RTutorBankRuns: RTutor problem set understanding bank runs

Problemset: Understanding Bank Runs

< ignore

>

Exercise Content

Exercise 1 -- Overview

1.1 Introduction to RTutor

1.2 Download and Load

< info "read.table"

>

< info "what do the buttons do"

>

< info "Collected data"

>

1.3 What is a Run? (Chapter IV, Figure 3)

< info "summarise_each"

>

< info "melt()"

>

< info "ggplot"

>

Exercise 2 -- Descriptive and Inductive Statistic

2.1 Load Data

2.2 Summary Statistics (Chapter IV, Table 1A)

< info "describe"

>

< info "NA"

>

2.3 Grouped Analysis (Chapter IV, Table 1B)

< info "group_by"

>

< info "variables of interest 1"

>

< info "facet_wrap"

>

< info "geom_text"

>

2.4 Validate Results (Chapter IV, Table 1B)

< info "two sample t-test"

>

< info "select"

>

< info "sapply()"

>

< info "TTest-function"

>

< info "p-value"

>

Exercise 3 -- General Overview and the Impact of an Insurance Cover

3.1 Load Data

3.2 Who Runs (Chapter IV, Table 2)

< info "Deriving the probit approach:"

>

< info "glm()"

>

< info "showreg()"

>

< info "marginal effects & adjusted standard error"

>

< info "p-value and stars"

>

< info "effectplot"

>

Exercise 4 -- Stata vs. R

< info "control for variable"

>

< info "variables of interest 2"

>

< info "factor"

>

< info "$-operator"

>

< info "binary.glm()"

>

< info "additional interpretation: comparing the models according to BIC and AIC"

>

Exercise 5 -- Relation of a Loan and Insurance Cover (Chapter IV, Table 3)

< info "variables of interest 3"

>

skranz/RTutorBankRuns
RTutor problem set understanding bank runs